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Abstract 

In this paper, Bayesian parameter estimation through the consideration of the Maximum A Posteriori (MAP) 
criterion is revisited under the prism of the Expectation-Maximization (EM) algorithm. By incorporating a 
sparsity-promoting penalty term in the cost function of the estimation problem through the use of an appropriate 
prior distribution, we show how the EM algorithm can be used to efficiently solve the corresponding optimization 
problem. To this end, we rely on variance-mean Gaussian mixtures (VMGM) to describe the prior distribution, 
while we incorporate many nice features of these mixtures to our estimation problem. The corresponding 
MAP estimation problem is completely expressed in terms of the EM algorithm, which allows for handling 
nonlinearities and hidden variables that cannot be easily handled with traditional methods. For comparison 
purposes, we also develop a Coordinate Descent algorithm for the f^-norm penalized problem and present the 
performance results via simulations. 


1 Introduction 

In engineering, an overwhelming number of applications deal with relating a random response variable y with a 

set of explanatory variables or covariates X = (xi, x 2 , ■ ■ ., x n ) through a regression model. The most usual case 

is that of the linear regression, namely ... 

y = X6 + e, (1) 

where y is the nx 1 response vector, X is the n x p regression matrix, 6 is the px 1 parameter vector and e 

is the n x 1 noise vector. The traditional way to perform parameter estimation in this setup is by considering 

the Maximum Likelihood (ML) approach. The corresponding estimates can be iteratively computed based on the 

Expectation-Maximization (EM) algorithm [7], when the observations are seen as incomplete data, see, e.g., [17]. 

On the other hand, when the parameter vector 0 is sparse, i.e., many of its entries are equal to zero, it is common 

to include a sparsity promoting penalty to the ML problem, yielding a regularized ML estimation problem. Those 

sparsity promoting penalties are usually the U-norm, the £ 9 -norm (0 < q < 1) and the £q pseudo-norm [8,9]. 

Depending on the penalty, different techniques are utilized to solve this regularized ML problem, such as proximal 

methods (PM), see, e.g., [2], or the local linear approximation (LLA) and the local quadratic approximation (LQA) 
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methods [10,25]. Another optimization approach for penalized problems corresponds to the implementation of the 
Coordinate Descent (CD) algorithm [24]. This technique has been successfully applied to the Lasso problem [23]. 
Its extension to non-convex penalties was introduced in [5]. Its particular application to sparsity problems with the 
f^-norm penalty and its convergence is analysed in [16]. In CD, the vector parameter estimates are obtained in a per 
coordinate base. Within the EM framework, the application of the CD approach yields the so-called Expectation- 
Conditional Maximum (ECM) algorithm [17]. In this paper, we show how to extend the ECM algorithm to £ g -nomi 
penalized ML problems with hidden variables using the CD method. 

A different approach corresponds to turn the penalty function in the form of a probability density function 
(pdf), accounting for some prior knowledge of the vector parameter 6. Thus, we can turn the ML estimation 
problem into a Maximum a Posteriori (MAP) problem. If the observations correspond to incomplete data , the EM 
algorithm can be easily modified to produce the posterior mode of G instead of the ML estimate of 6. Denoting 
by p(y\G) the response sampling density, the corresponding log-likelihood function is C{9) = \ogp{y\G). For ML 
estimation, the EM algorithm iteratively selects a value 9* that maximizes C{9). In the case of MAP estimation, 
the function that is maximized by the EM algorithm is jC(9) + G(9), where G{9) = log p{9) is the log-prior of the 
parameter vector G. From these observations, it becomes clear that the EM algorithm is a quite general tool for 
solving either classical or Bayesian estimation problems. 

In this paper, we formalize an EM-based framework for regularized ML parameter estimation using a MAP 
perspective. Motivated by the recent advances in regularized parameter estimation and in compressed sensing, 
we propose a way to introduce the l q —norm sparsity-promoting penalty into our estimation problem by properly 
selecting the prior through the use of variance-mean Gaussian mixtures (VMGM) [3]. Different approaches have 
been considered in the literature using VMGM. In [11], the £\-norm penalty is considered for promoting sparsity, 
which is interpreted as a Laplace prior for a MAP interpretation of the t\ penalty using the EM algorithm. For 
robust regression, VMGMs can be utilized to express a least £ q regression as a ML estimation problem, which 
is then solved using the EM algorithm [14], The convergence properties of the approach in [14] solved by using 
iterative re-weighted least squares were presented in [1], A general approach for MAP estimation is considered 
in [18], where the main idea for utilizing the VMGMs and EM algorithm for the prior pdf was extended to different 
VMGMs. This idea was further extended in [12] for sparse parameter vectors with quantized data, which in turn 
is a generalization of the work in [6]. In this paper, we provide all the relevant theory for the formulation of the 
EM algorithm, when VMGMs are employed, for hidden variables in both the likelihood function and the prior pdf 
tailored to the f^-norm, 0 < q < 1. To illustrate the effectiveness of our proposal, we present an example with both 
hidden variables and nonlinearities and compare the results with utilizing the £\- and the t g -norms with both the 
CD and the EM-based MAP approaches. 

The remainder of the paper is organized as follows: Section 2 introduces the basic EM definitions for the ML, 
penalized ML and MAP estimation problems. The derivation of the ECM algorithm with the £ g -norm penalty is 
included. Section 3 introduces the basic notion of mixtures for expressing a prior distribution and provides all the 
necessary expressions for the formulation of the EM algorithm, when the prior is expressed as a VMGM. Section 
4 presents the VMGM expressions for the i q —norm penalty. Finally, simulation results are provided in Section 5, 
while Section 6 concludes the paper. 

2 Basic Description of the EM algorithm 

The EM algorithm is basically formulated to iteratively solve the ML estimation problem: 

0ml = argmax£(0). (2) 

0 

Using the notion of a hidden variable z [17], we can easily express the log-likelihood function as follows: 

C{G) = Qml ( G , - H M l ( 6 , , (3) 
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where 6 is the estimate of 6 at the ith iteration, and 


Qml = J logp(z,y\G)p (z\y,6 {,) ^j dz, 

H M l (0,9 <,] ^J = f logp(z\y,G)p (z|y,0 ( °) dz. 


Using Jensen’s inequality [4], one can readily see that [7] 


Hml {g,G ^ < Hml 


§ W ,6 W 


for any 6. Therefore, the EM algorithm for the ML estimation problem can be formulated as follows [17]: 


E-step: Compute 
M-step: Solve 


Qml ( 0 , G ( '^ = E logp(z, y\G)\y ,! 


UU1VC n ( a A(i)\ 

6 = argmax Q M l [G,6 J . 

Under some general conditions, the above algorithm converges to the ML estimate of G [17]. 


2.1 Penalized ML estimation via the EM algorithm 

Many regularized estimation problems involve the optimization of a cost function of the form 

U(6) = C(G) - g{6), (4) 

where g(- ) is a penalty function. If we consider that C(G) is the log-likelihood function, then this is a penalized 
ML estimation problem. For the attainment of the estimates, different approaches can be used, depending on the 
nature of the penalty function. In our case, in order to promote sparsity on 6, a popular choice for the penalty 
function is g(G) = ^ ||0|j'[ with 0 < q < 1. When q = 1, we obtain the Lasso [21]. In order to solve the penalized 
ML estimation problem, the modifications to the EM algorithm are implemented on the M-step, while the E-step 
remains the same as before [17, Chap. 5.18]. Hence, the modified EM algorithm for penalized ML estimation is 
given as follows: 

E-step: Compute , r , ,, 


.Cj-SsLUp. V_yUllipLlLtJ / r /^-i 

Qml [d,6 j = E log p(z,y\G)\y,G 

M-step: Solve „( i+: n , 

G = argmax Q M l [0,6 J - g{G). 

In order to understand why traditional methods cannot be applied straightforwardly, let us consider a linear 
regression as in (1), but where the matrix X is a function of a hidden variable, e.g., an unknown state in a 
state-space representation. In such a case, when performing the E-step, we need to compute Ei = E[X\y\ and 
E 2 = E[X T X\y\. In general, methods that do not consider hidden variables rely on the relationship between 
the matrix X and X T X. However, in our case, E 2 7 ^ E^E 1 , and hence traditional methods should be modified 
accordingly. 


2.1.1 The ECM algorithm for f 9 -norm penalized ML estimation 

For non-convex penalties, a CD algorithm [24] (also called Cyclic Descent) was first proposed and analysed in [5] and 
specifically applied to the £ 9 -norm penalty as a sparsity inducing norm in [16]. The basic idea behind [5] and [16] 
is to optimize the penalized ML problem in a coordinate-wise manner. This implies computing the gradient of the 
log-likelihood function with respect to one component of 6 and solving for this component, while the remaining 
components are considered constants. This is implemented for each component of the parameter vector G in a 
cyclic manner until convergence. In our analysis, the system model allows for obtaining an auxiliary function 
Qml ( G , 6 { ^ that is quadratic with respect to G. To see this, let us first assume that the regressor matrix X is 
a function of a hidden variable z, X = f[z), and that the measurement noise is zero mean Gaussian distributed 
with covariance matrix S £ . With these assumptions, we have: 

Qml (M W ) = (y T ^ 1 y-2y T H- 1 E[X\y,G {i) ]6 

+ G T E[X T E; 1 X\y,G ii) }d) , 
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which is clearly a quadratic function of 9. Then, we compute the gradient with respect to only one component of 
9. For simplicity, let us define E\ = E[X\y,$ *] and Eo = E[X T S ^ 1 X\y , 9^ ]. With these we obtain 


d 


86, 


Qml(M ) = {[E.jj^y-qjE.e), (5) 

where [E\\j is the jth column of E\ and is the jth column of the identity matrix. If we solve for 9j , we obtain 

Since we are interested in the penalized ML problem, we include the penalty for the jth component of 9 and solve 
for 


o(*+ 1 ) 

y) = arg max 


?>“2 C 


1 / ~(*+ 1 ) fl (*+ l )^ 2 


i(i+l) 


T E 


2j j,j 


using the scalar optimization algorithm described in [16]. 


2.2 MAP estimation via the EM algorithm 

Let us consider now the case of the MAP estimation. Based on Bayes’ theorem, the log-posterior pdf of 0 can be 
expressed as: 

\ogp{B\y)=C(B) + G{B)-\ogp{y). ( 6 ) 

Using (3), ( 6 ) and the fact that Hml (b,6^ ^ is a decreasing function of 9, the EM algorithm for the MAP 
estimation problem is formulated as follows: 

E-step: Compute 

Smap (M 10 ) =E[p(z,y\9)\yJ il) ] +G(9). 

M-step: Solve 

0 (!+1) = arg max <2 MAP . 

In the next section, we will see how we can express the log-prior so that the EM algorithm can be used to tackle 
the £ q —norm regularized ML estimation problem. 


3 MAP: The prior pdf as a mixture of distributions 


Focusing on the prior distribution for MAP estimation, it is possible to express p{9) as a marginal probability 
density function (pdf), occurring by integrating out the rest of the variables in a possible joint pdf. In this sense, 
one can express p{9) as an infinite mixture, where there is an underlying process that generates the desired pdf for 
the parameters. More precisely: 

P (9) = J p(9\\)p(X)dX, (7) 

where A G IR r is interpreted as a hidden (hyper)-parameter vector. If the conditional pdf p(9\X) is Gaussian, then 
we obtain a Gaussian mixture, see, e.g., [3,13,22]. Hence, we can express the log-prior as 


log p(9) = Qprior (0, 0 W ) - 7/ prior (b, 9 W ) ( 8 ) 

with 

Qprior (M W ) = Jl0gp(9,X)p (A|0 W )dA, 

Qprior (M^) = j l0gp(X\9)p (A|0 (O )dA. 


The auxiliary function H pr i or (d, 9^ ^ is decreasing in 9 1 . 
follows: 

1 In the same way as 'H (6 , 6^ *). 


Therefore, the EM algorithm can be formulated as 
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E-step: Compute 


M-step: Solve 


QmAP-EM (o,0 ) = QmL ( Q Q ( ) + Qprior (o, O' ) • 

0 (l+1) = arg max Qmap-em . 

For the M-step of the last EM algorithm, we need to evaluate the gradient of Qmap-em ^ 0,6^ ' j with respect 
to 0. Assuming that VaQml ^ 0,0^ ^ is known from the ML estimation problem, we only need to evaluate 
VeQprior ( 0 , 0 ( * } ). 

Lemma 1 When expressing the prior pdf of 0 , p{0), as a VMGM and letting 0|A ~ A/"(/x e (A), Ee(A)), the gradient 
of the auxiliary function Q pr ior {o,0' is given by 

V e G(0) 


«<*>= ^[“^(A)]^ 

-U 


+E xlS w [S e 1 (A) / x e (A)] . 


(9) 


Proof: First, notice that the marginal of A is independent of Q. Therefore, 

V e Q prior (0J (t) ) = J S7 e logp(0\X)p (A|0 (i) )dA. 

Given that 0 (A ~ A/"(/r(A), Sg(A)), we obtain: 

V e Q pr ior (0 J (i) )= J [-E e 1 (A)(6»-/x(A))]p(A|0 ti) ) dA 

[—^'e 1 (A)] 0 

+ £ A|g(i) [S^(A)p(A))]. (10) 

The expected values in the last equation can be computed based on the Gaussianity assumption for 0|A. Due to 
this assumption, p(0) satisfies the following equation: 


V e p(0) = J [-E e 1 (A)( 0 -p(A))]p(ejA)p(A)dA. 


On the other hand, Bayes rule yields: 


PW 


Plugging in the last expression into (11), we obtain: 


Pimm 

P(0 1 A) 


V e p(0)= J [-E e 1 (A)( 0 - / x(A))]p(A| 0 )p( 0 )dA. 


( 11 ) 


Therefore, 

W) sl6P{e) = Exle [_Sel(A)(0 ■ ^ (A)) ] • (12) 

We now note that (12) equals G(0) = Vglogp(0). Nevertheless, G(0) is known and so is G(0). Therefore, (9) 
follows. □ 


The expression in (9) corresponds to a linear system of equations that can be generally solved, because p ^A|0 l ^ 
is a known distribution. Thus, the expected values can be computed either in closed form when possible or using 
Monte Carlo techniques, such as the Metropolis-Hastings algorithm or the Gibbs sampler [19]. A particular case 
arises when G(0) can be expressed as a function of the coordinate variables of Q as follows: 

G(0)=f>(^). (13) 

3 =1 V ' 


Here, r is a factor controlling the strength of the regularization. Different selections of the kernel function k(-) 
lead to different regularizations. Usual regularization choices can be obtained based on Gaussian mixtures, e.g., 
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the Ridge regression for n(0j/T ) = (9j/t) 2 and the Lasso for k(9j/t) = \0j/r\ [12]. The general form of a VMGM 
for a scalar parameter Oj is given by 

/>oo 

p{6j) = / A/ - (nj + Xiij,T 2 X) p(X)d\. (14) 

Jo 

Based on this formula, the Ridge regression is obtained for Uj = fij = 0 and A = 1 and the Lasso is obtained for 
u : j = fij = 0 and A exponentially distributed. Clearly, these two very usual regularizations are exactly expressed 
as VMGMs. For the cases where pij = 0, the following lemma holds: 


Lemma 2 When pj 


Proof: See [12]. 


0,j = 1,2, then 

aQpHor ( [e,e {i) ) 

dOj 


i 



n(dj) 


le,-=e ( . 0 Oj- 


(15) 

□ 


In the next section, we present in detail the VMGM descriptions of the i q — norm regularization term. 


4 Sparse Parameter Estimation using VMGMs 


As it has already become apparent, the penalties that are usually considered for promoting sparsity can be inter¬ 
preted as MAP problems. In this section, we explore the representation of the t q — norm as a prior distribution 
function for developing a sparse-promoting algorithm under the utilization of VMGMs. To this end, the density 
function that relates to the i q —norm is given by 


p 

Pe q ( 9 ) = IT (0j)> 

j =i 

Pi q i 9 j) = 


2 r 


( 1 + i) 


_ e ~fq\8j\ q 


(16) 

(17) 


which corresponds to a leptokurtic exponential power distribution [20]. The prior pe. q (0j) in (17) can be expressed 
as a VMGM, with p(Xj) being a stable distribution in (14). Then, the MAP-EM algorithm can be used to solve 
the aforementioned sparsity-promoting estimation problems. 


4.1 MAP-EM algorithm for sparse parameter estimation 


In this section, we present the relevant expressions in the development of the MAP-EM algorithm for sparse- 
parameter estimation using the i q — norm. 


Lemma 3 In the case that the log-prior function is given by (13) andp(dj) is a VMGM as in (14), d Q P r,or(e,e —, Vj = 
1 , 2, ...,p, is then given by 


dQprioriO,^) 


dOj 


-%E[\^\ef]\ , 


(18) 


where E [A - 1 \9j ] is the expectation obtained from 





= —k 



Equivalently, we obtain 


where 


K (0 


dQprior (0,0 W ) 


CW 



= -K^e, 

Up 



(19) 


( 20 ) 

( 21 ) 
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Proof: See [121. □ 

~ (*) 

Since the system is defined as a linear regression, it yields the derivative of Qml{9,9 ) with respect to 9 in the 

form: 

nn,, r (a a'W _ p[v ,, yi„ 

( 22 ) 


d0 


E[X T \y, 0 (i) ]y - E[X T X\y , 0 (i) ]0 




Then, combining (18) and (22), we obtain the derivative of <2 map-em( 0,0 ) with respect to 6 as: 

„ (j) 

^Qmap-em(^)^ ) _ -|_ k^)G + aS l \ 

dO 


(23) 


where B^ = E[X T X\y,9^ ^j/cr 2 , K ^ is defined in (21) and = E[X T \y,9 { ^ y/a 2 . In particular, the 
(component-wise) derivative in Lemma 3 for the £ q —norm of 9 is given by 


(0, 


dQ P rior(9,9 {t) ) 

dO, 


0) 


9-1 


sign# 0 ) 


0 


J ’0,. 


(24) 


Remark 1 Notice that the MAP-EM algorithm yields an optimization problem whose solution is given by a system 
of linear equations, and hence, it is easy to solve 2 . This is due to the quadratic form of Q pr i or {9. 9 ) with respect 


to 9. 


V 


Remark 2 ([12]) In our approach, we have to deal with derivatives of functions of the form d\a\/da, and the 
derivative at zero does not exist. However, it is possible to obtain the desired value at almost all points using 

—jJ- = sign(a), Va / 0. (25) 

d a 

Most of the algorithms promoting sparsity have to deal with non-differentiable cost functions, see, e.g., [10,25], where 
a local approximation of the £\-norm is obtained. A common solution to overcome the issue of non-differentiability 
when a —> 0 in (25), is to start the algorithm from a value far from zero, and then remove the component a from 
the parameter vector when it enters in a neighbourhood (defined by the user) of a = 0 [11,18]. V 


5 Simulations 


In this section, we provide a numerical illustration for the applicability of our framework. For the simulations, we 
consider a limited number of measurement points, since, in general, MAP estimates converge to ML estimates as 
the number of data points increases. This is known as the Bernstein-Von Mises theorem [15]. In our experience, 
we have been able to observe this situation for the MAP-EM algorithm with the £ q -norm, 0 < q < 1. 

For the example, we consider the following system model: 


Zk+l T (hi 


Vk = sin 


/ u>(k — 1 ) 



'^'kik+p—l 9 + ilk, 


where Uk-.k+p- 1 = [uk, itfc+i, •••, u k+p-i] T is a known input, 9 = [—0.77, —1.55, 0, 0,0, 0,0.46] T is the unknown 
sparse parameter vector, N = 256 is the number of measurement points, uj = 5 is a known constant parameter, 
fk ~ AC(0,0.1), and r/k ~ A/"(0,0.1). For the estimation of 9, we consider the f^-norm as presented in this paper, 
with q = 0.1 and r = 0.1. For comparison purposes, we also consider the corresponding penalized ML problem 
and solve for it using the EM framework in Section 2.1 with q = 0.1 and r = 0.1 as well. Finally, we also consider 
the approach in [12] with r = 1, which corresponds to q = 1. Notice that the initial estimate was the same for the 
four estimation techniques. 

The results of the estimation algorithms are shown in Figs. 1—4. From those figures, we can clearly see that 
the proposed approach can effectively estimate all the zeros of 9. However, the convergence is not achieved at the 
same time for each zero-entry of 9. Our approach clearly outperforms the fd-norm MAP-EM approach and ML. 
In contrast, the CD approach can accurately estimate the zeros of 9 at the first iteration of the EM algorithm. 
However, this estimation can vary from iteration to iteration, since the optimization procedure is carried out 

-When compared to the equivalent nonlinear problems obtained by directly applying Lasso or the bridge regression [10] with 
0 < q < 1. 
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Figure 1: Convergence of the estimates of the first zero of 9 per iteration of the EM algorithm. 



Figure 2: Convergence of the estimates of the second zero of 9 per iteration of the EM algorithm. 


globally for the a single coordinate. As mentioned in [5], this kind of optimization should be carried out locally. 
From Figs. 1 and 2, we can see that at iteration no. 34, the zeros are correctly estimated. However, at iteration no. 
35, those estimates are no longer zero. In contrast, our proposed method forces the zero value once the estimate 
converges to zero, following the ideas in Remark 2. In addition, the mean square error (MSE) of the estimates are: 
i) 3.56 x 10 -5 for ML, ii) 1.124 x 10 -4 for MAP-EM with q = 1, iii) 1.96 x 10 -5 for MAP-EM with q = 0.1, and 
iv) 3.48 x 10” 5 for CD. The difference in MSE is due to the slow convergence of the non-zero elements that CD 
exhibits. Thus, we believe that our estimation algorithm could be initialized with one iteration of CD for quicker 
convergence of the EM algorithm. 


6 Conclusions 

In this paper, we have described a general framework for the implementation of the EM algorithm based on mean- 
variance Gaussian mixtures, for promoting sparsity via £ q —norm regularization (0 < q < 1). Our proposal allows a 
(computationally-efficient) simple way of handling the sparsity estimation problem, yielding closed form expressions 
for the EM algorithm. A particular case of the proposed approach includes q = 1 (Lasso). The numerical illustration 
shows the effectiveness of our approach, outperforming both the ML and the Lasso estimators when choosing q < 1. 
In addition, the simulations show that our approach can obtain better estimates than Coordinate Descent within 
the EM framework. Since the Coordinate Descent algorithm is capable of finding the zeros quickly but may diverge, 
it can be used to initialize our approach for faster convergence. 
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Figure 3: Convergence of the estimates of the third zero of 0 per iteration of the EM algorithm. 



Figure 4: Convergence of the estimates of the fourth zero of 6 per iteration of the EM algorithm. 
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